Maximum Likelihood Estimation for the Multinomial

Maximum likelihood as a special case of Bayesian estimation

We can obtain the maximum likelihood estimate for $\mu_k$ based on $N$ throws of a $K$-sided die within the Bayesian framework by letting the prior for $\mu$ approach a uniform distribution. For a Dirichlet prior $\mathrm{Dir}(\mu | \alpha)$, this corresponds to setting $\alpha \rightarrow (1, 1, \dots, 1)$.

Prove for yourself that

$$\begin{align*} \hat{\mu}_k &= \arg\max_{\mu_k} p(D|\mu) = \frac{m_k}{N}\,. \end{align*}$$

Model specification

data-generating distribution

The outcomes $x_n$ are encoded as

$$x_{nk} = \begin{cases} 1 & \text{if the $n$-th throw landed on $k$-th face}\\ 0 & \text{otherwise} \end{cases}$$

and the likelihood function for $\mu$ is now

$$p(D|\mu) = \prod_n \prod_k \mu_k^{x_{nk}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29}$$

where $m_k= \sum_n x_{nk}$ is the total number of occurrences that the outcome landed on face $k$. The vector $m = (m_1,m_2, \ldots, m_K)^T$ is known as the count vector. Note that $\sum_k m_k = N$.

This distribution depends on the observations only through the ''observed'' counts $\{m_k\}$. For given counts $\{m_k\}$, $p(D|\mu)$ can be interpreted as a likelihood function for $\mu$.

The Categorical Distribution

Consider a toss with a $K$-sided die. We use a one-hot coding scheme, i.e., the outcome is encoded as

$$x_{k} = \begin{cases} 1 & \text{if the throw landed on $k$-th face}\\ 0 & \text{otherwise} \end{cases} \,.$$

Assume the probabilities

$$p(x_{k}=1) = \mu_k \quad \text{with } \mu_k \geq 0 \text{ and }\sum_k \mu_k = 1 \,.$$

The data generating distribution for one-hot encoded outcome $x = (x_{1},x_{2},\ldots,x_{K})$ (and $\mu = (\mu_1,\mu_2,\dots,\mu_k)^T$) is then given by

$$p(x|\mu) = \mu_1^{x_1} \mu_2^{x_2} \cdots \mu_K^{x_K}=\prod_{k=1}^K \mu_k^{x_k} \tag{B-2.26}$$

This generalized Bernoulli distribution is called the categorical distribution.

prior distribution

Next, we need a prior for the parameters $\mu = (\mu_1,\mu_2,\ldots,\mu_K)^T$.

In the binary coin toss example, we used a beta distribution that was conjugate with the binomial and forced us to choose prior pseudo-counts.

The generalization of the beta prior to $K$ parameters $\{\mu_k\}$ is the Dirichlet distribution:

$$p(\mu|\alpha) = \mathrm{Dir}(\mu|\alpha) = \frac{\Gamma\left(\sum_k \alpha_k\right)}{\Gamma(\alpha_1)\cdots \Gamma(\alpha_K)} \prod_{k=1}^K \mu_k^{\alpha_k-1} $$

where $\Gamma(\cdot)$ is the Gamma function.

The Gamma function can be interpreted as a generalization of the factorial function to the real ($\mathbb{R}$) numbers. If $n$ is a natural number ($1,2,3, \ldots $), then $\Gamma(n) = (n-1)!$, where $(n-1)! = (n-1)\cdot (n-2) \cdot 1$.

As before for the Beta distribution in the coin toss experiment, you can interpret $\alpha_k$ as the prior number of (pseudo-)observations that the die landed on the $k$-th face.

Prediction of next toss for the loaded die

Let's apply what we have learned about the loaded die to compute the probability that we throw the $k$-th face at the next toss.

$$\begin{align*} p(x_{\bullet,k}=1|D) &= \int p(x_{\bullet,k}=1|\mu)\,p(\mu|D) \,\mathrm{d}\mu \\ &= \int_0^1 \mu_k \times \mathcal{Dir}(\mu|\,\alpha+m) \,\mathrm{d}\mu \\ &= \mathrm{E}\left[ \mu_k | D\right] \\ &= \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \end{align*}$$

(You can find the mean of the Dirichlet distribution $\mathrm{E}\left[ \mu_k \right]$ at its Wikipedia site).

This result is simply a generalization of Laplace's rule of succession.

Categorical, Multinomial and Related Distributions

In the above derivation, we noticed that the data generating distribution for $N$ die tosses with data outcomes $D=\{x_1,\ldots,x_N\}$ only depends on the counts $m_k$:

$$p(D|\mu) = \prod_n \underbrace{\prod_k \mu_k^{x_{nk}}}_{\text{categorical dist.}} = \prod_k \mu_k^{\sum_n x_{nk}} = \prod_k \mu_k^{m_k} \tag{B-2.29}$$

Now consider a $K$-sided coin (e.g., a six-faced die (pl.: dice)). How should we encode outcomes? Two natural options present themselves:

Option 1: label encoding

$$x \in \{1,2,\ldots,K\} \,.$$

E.g., for $K=6$, if the die lands on the 3rd face, then $x=3$.
This coding scheme is called label (or index) encoding.

Option 2: one-hot encoding

$$x = (x_1,\ldots,x_K)^T $$

where $x_k$ are binary selection variables, given by

$$x_k = \begin{cases} 1 & \text{if die landed on $k$th face}\\ 0 & \text{otherwise} \end{cases}$$

For instance, for $K=6$, if the die lands on the $3$-rd face, then $x=(0,0,1,0,0,0)^T$.
This coding scheme is called a 1-of-K or one-hot coding scheme.

It turns out that the one-hot coding scheme is mathematically more convenient!

Discrete Data: the 1-of-K Coding Scheme

Consider a coin-tossing experiment with outcomes $x \in\{0,1\}$ (tail and head, respectively) and let $0\leq \mu \leq 1$ represent the probability of heads. The data generating distribution for this model can written as a Bernoulli distribution:

$$ p(x|\mu) = \mu^{x}(1-\mu)^{1-x}$$

Note that the variable $x$ acts as a (binary) selector for the tail or head probabilities. Think of this as an 'if'-statement in programming.

°persist_js_stateÂ¤mime©text/html²last_run_timestampËAÚBøÒ·has_pluto_hook_featuresÂ¬rootassigneeÀ§cell_idÙ$acdc5bfa-7188-4a37-80e6-5026ecd1a813¹depends_on_disabled_cellsÂ§runtimeÍ%¶µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$d843a338-d294-11ef-2748-b95f2af1396bŠ¦queuedÂ¤logs§runningÂ¦output†¤bodyÚ

Inference for $\{\mu_k\}$

The posterior for $\{\mu_k\}$ can be obtained through Bayes rule:

$$\begin{align*} p(\mu|D,\alpha) &\propto p(D|\mu) \cdot p(\mu|\alpha) \\ &\propto \prod_k \mu_k^{m_k} \cdot \prod_k \mu_k^{\alpha_k-1} \\ &= \prod_k \mu_k^{\alpha_k + m_k -1}\\ &\propto \mathrm{Dir}\left(\mu\,|\,\alpha + m \right) \tag{B-2.41} \\ &= \frac{\Gamma\left(\sum_k (\alpha_k + m_k) \right)}{\Gamma(\alpha_1+m_1) \Gamma(\alpha_2+m_2) \cdots \Gamma(\alpha_K + m_K)} \prod_{k=1}^K \mu_k^{\alpha_k + m_k -1} \end{align*}$$

where $m = (m_1,m_2,\ldots,m_K)^T$ is the count vector.

°persist_js_stateÂ¤mime©text/html²last_run_timestampËAÚBöÜì·has_pluto_hook_featuresÂ¬rootassigneeÀ§cell_idÙ$d843a338-d294-11ef-2748-b95f2af1396b¹depends_on_disabled_cellsÂ§runtimeÎ”`µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂÙ$3c2ee96d-18a6-45d0-a2cf-f2ebbf5e22f0Š¦queuedÂ¤logs§runningÂ¦output†¤bodyÚD

Click for the solution

$$\begin{align*} p(&x_{\bullet,k}=1|D) = \frac{m_k + \alpha_k }{ N+ \sum_k \alpha_k} \\ &= \frac{m_k}{N+\sum_k \alpha_k} + \frac{\alpha_k}{N+\sum_k \alpha_k}\\ &= \frac{m_k}{N+\sum_k \alpha_k} \cdot \frac{N}{N} + \frac{\alpha_k}{N+\sum_k \alpha_k}\cdot \frac{\sum_k \alpha_k}{\sum_k\alpha_k} \\ &= \frac{N}{N+\sum_k \alpha_k} \cdot \frac{m_k}{N} + \frac{\sum_k \alpha_k}{N+\sum_k \alpha_k} \cdot \frac{\alpha_k}{\sum_k\alpha_k} \\ &= \frac{N}{N+\sum_k \alpha_k} \cdot \frac{m_k}{N} + \bigg( \frac{\sum_k \alpha_k}{N+\sum_k \alpha_k} + \underbrace{\frac{N}{N+\sum_k \alpha_k} - \frac{N}{N+\sum_k \alpha_k}}_{0}\bigg) \cdot \frac{\alpha_k}{\sum_k\alpha_k} \\ &= \frac{N}{N+\sum_k \alpha_k} \cdot \frac{m_k}{N} + \bigg( 1 - \frac{N}{N+\sum_k \alpha_k}\bigg) \cdot \frac{\alpha_k}{\sum_k\alpha_k} \\ &= \underbrace{\frac{\alpha_k}{\sum_k\alpha_k}}_{\text{prior prediction}} + \underbrace{\frac{N}{N+\sum_k \alpha_k} \cdot \underbrace{\left(\frac{m_k}{N} - \frac{\alpha_k}{\sum_k\alpha_k}\right)}_{\text{prediction error}}}_{\text{data-based correction}} \end{align*}$$

(If you know how to do it shorter and more elegantly, please post in Piazza.)

This decomposition is the natural consequence of doing Bayesian estimation, which always involves a prior-based prediction term and a likelihood-based (or data-based) correction term that can be interpreted as a (precision-weighted) prediction error.

°persist_js_stateÂ¤mime©text/html²last_run_timestampËAÚBù‚·has_pluto_hook_featuresÂ¬rootassigneeÀ§cell_idÙ$3c2ee96d-18a6-45d0-a2cf-f2ebbf5e22f0¹depends_on_disabled_cellsÂ§runtimeÎ6µpublished_object_keys¸depends_on_skipped_cellsÂ§erroredÂ©shortpath¿The Multinomial Distribution.jl®last_save_timeËAÚBö\«in_temp_dirÂ¨metadata«frontmatter‚¦author‘‚¤name¥BMLIP£url¸https://github.com/bmlip«descriptionÙSBayesian and maximum likelihood density estimation for discretely valued data sets.

Evidence for the Multinomial-Dirichlet model (**)

Maximum Likelihood estimation (**)

Summary

Maximum Likelihood Estimation for the Multinomial

Maximum likelihood as a special case of Bayesian estimation

Laplace's Generalized Rule of Succession (**)

Model specification

data-generating distribution

The Categorical Distribution

prior distribution

Code

Prediction of next toss for the loaded die

Exercises

Bayesian Density Estimation for a Loaded Die

Discrete Distributions (*)

Preliminaries

Goal

Materials

Maximum likelihood estimation by optimizing a constrained log-likelihood

Option 1: label encoding

Option 2: one-hot encoding

Discrete Data: the 1-of-K Coding Scheme

Inference for $\{\mu_k\}$